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ABSTRACT 

Theoretically preferred item response theory (IRT) 
bias detection procedures were applied to both a mathematics 
achievement and vocabulary test. The data were from black seniors and 
white seniors on the High School and Beyond data files. We wished to 
account for statistical artifacts by conducting cross-validation or 
replication studies. Therefore, each analysis was repeated on 
randomly equivalent samples of blacks and whites (n's=1500). 
Furthermore, to establish a baseline for judging bias indices that 
might be attributable only to sampling fluctuations, bias analyses 
were conducted comparing randomly selected groups of whites. Also, to 
assess the effect of mean group differences on the appearance of 
bias, pseudo-ethnic groups were created, i.e., samples of whites were 
selected to simulate the average black-white difference. The validity 
and sensitivity of the IRT bias indices were supported by several 
findings. The pattern of between study correlations showed high 
consistency for parallel ethnic analyses where bias was plausibly 
present. Also, the indices met the discriminant validity test . 
Overall the sums-of -squares statistics (weighted by the inverse of 
the variance errors) were judged to be the best indices for 
quantifying item characteristic curve differences between groups. 
(PN) ' 
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The question of bias in tests of achievement and mental ability 
is one of the most prominent issues in both the technical psychometric 
literature and the more general public concerns about education. Tests 
are used -to make many significant decisions about individuals, for example: 
to measure school achievements, to assign children to special classes, and 
to select applicants for prestigious schools and professions. Tests 
are intended to be more reliable and impartial than subjective judgments 
about individuals; but if a test is invalid, the decisions deri ved from 
it will be unfair. Bias is a kind of invalidity that is 
disadvantageous to members of certain subgroups th^t take the test. If 
a test is biased, critical educational and life decisions may be made 
unfairly for members of those subgroups. 



■''Paper presented at the annual meeting of the American Educational 
Research Association, Montreal, April 1983. 

^We wish to thank the Council on Research ?nd Creative Work and 
Dean Richard Turner, School of^ Education, University of Colorado, for their 
financial support. 
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Scholars have taken three general approaches to research on the 
question of test bias (see Cole, 1981; Bond, 1981); (1) predictive validity 
studies in selection situations; (2) investigation of external biasing 
factors such as the race of examiner, test-wiseness of examinees, and speed 
of administration; and (3) construct or content validity studies of the 
internal structure of the test. The present research is focused on test 
item-bias methods, which are subsumed in the last category of inquiry. 
Item-bias methods are statistical procedures intended to test whether 
items function equivalently in two groups. Therefore, they address the 
basic validity question: does the test (or individual items in the test) 
measure what it purports to measure for both groups? 

Numerous item-bias methods exist (see Berk, 1982; Rudner, 1980; 
Shepard, 1981). Most rely on an item-by-group-interaction criterion 
of bias; that is, statistical adjustments are made for overall group 
differences, and then items that are relatively more difficult for one 
group are flagged as potentially biased. For example, an early method 
proposed by Angoff (1972) was based on transformed item difficulties. 
The proportion getting each item correct was computed separately for two 
groups, e.g., blacks and whites. Then, (after a transformation to 
linearize the relationship) the scattergram depicting the relationship 
between the two sets of p-values was examined. Most of the data points 
would fall along a line from the lower left to the upper right-hand corner 
of the graph, indicating test items that ranged from very easy in both 
groups to very difficult. Items which deviated substantially from this 
•principal axis line were those that were relatively more di-^ficult in one 
group than the other, and hence possibly biased 



3 



A standard operating, assumption should be discussed regarding 
item-bias techniques. Because they lack an external criterion, they can 
only be used to detect relative bias, not pervasive bias in a test (Petersen, 
1977). As in the above example, the various methods either use total 
test score (or estimated abilities from the total set of items), or 
average p-value differences to define the "typical*' difference between 
groups; this then becomes the standard of "unbiasedness" against which 
individual items are compared. Thus, if there is bias in the determination 
of this typical group difference, it will go undetected by these techniques. 
Despite this limitation, it ha? been argued that item-bias procedures may 
be the preferred approach for understanding the nature of bias and for 
uncovering irrelevant difficulties in items which charge their meaning 
for members of different groups (Shepard, 1982). The predictive validity 
models of test fairness involve an external criterion but are not without 
fault. Petersen and Novick (1976) demonstrated that the various models 
for defining equal regressions (i.e., equal predictive validity for two 
groups) are mutually contradictory. Moreover, Linn (1982) has recently 
explained how differential measurement error for two groups could obscure 
pr3dictive bias. Finally, there is the actuarial problem (Shepard, 1982). 
Predictive validity studies look only at the magnitude of the correlation 
between test and criterion; they do not distinguish between relevant and 
irrelevant sources of relationship. Nor do they examine whether the 
combination of predictors that maximize the correlation are equally 
defensible. Test-item bias methods let us look more directly at what we 
are measuring. They leave for a second step the question of how measures 
of separate traits should be combined to make selection or other test-based 

± 
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decisions. 

Among item-bias techniques, the theoretically preferred method 
is the three-parameter item response theory (IRT) method, also called 
the three parameter latent-trait or item-characteristic-curve (ICC) method. 
It is preferred because of its sample-invariant properties that make it 
less likely that true group differences will be mistaken for bias. 
Hunter (1975) and Lord (1977) have demonstrated heuristically that bias 
techniques based on classical test theory (such as p-value differences 
or point-bi serials) will produce invalid indices of bias in the presence 
of group mean differences. Because £ differences interact v/ith item 
discrimination, items that are merely more discriminating (i.e., better 
measures of the trait in both groups) will have bigger differences in 
performance. Furthermore, the variability of a particular group and how 
"centered" an item is for that group, will arti factually control the 
item's discriminating power. Methods such as empirical ICCs (Green & 
Draper, 1972) and chi-square procedures (Scheuneman, 1979) were intended 
to be approximations to the IRT method. These procedures are crude, however, 
and will still confound real group differences with bias because of 
regression effects. The one-parameter latent-trait method (or Rasch model) 
shares the theoretically sample invariant properties of the three- 
parameter model. However, the Rasch model is not recommended for bias 
detection because it will confound other sources of model misfit (particu- 
larly differences in item discrimination) with item "bias" (see Divgi, 
1981; Ironson, 1982; Shepard, Camilli &Averill, 1981). 

The conceptual definition of bias using the three-paran'eter IRT 
model and specific procedures will be explained in the method section. The 

ERIC 
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three-parameter IRT method is applied in this paper because it is 
theoretically the most sound. Its superiority is relative, however, rather 
than absolute; bias detection using IRT is not without problems. First 
there are estimation problems due to sampling fluctuations. Even with 
reasonably large sample sizes, as in this study, it is possible that 
misestimated item parameters for separate groups could create or obscure 
item-characteristic curve differences when the two groups are compared. 
More importantly there may be larger sources of error when samples are 
very different. Even the theoretical claims for the model are said to be 
true only when the model holds. The following discussion is focused on 
the potential for obtaining invalid bias indices, even with IRT methods. 
First, however, a digression is in order regarding substantive interpre- 
tation of bias indices. Difficulties encountered when trying to make 
substantive interpretations of bias analyses may be linked to the problem 
of statistical artifacts. 

Substantive Interpretations of Bias 

Given increasing concern over cultural bias in tests, a strong 
impetus to the development of statistical screening techniques was the 
apparent failure of judgmental methods for identifying biased test questions. 
That is, even minority experts, sensitive to the issue of cultural loading 
in test questions, could not predict with better than chance success what 
type of items would be difficult for members of particular groups (Jensen, 
1977; Plake, 1980; Sandoval & Miille, 1980). For example, Jensen (1976) 
found that an item on the WISC often cited for its dependence on white 
middle-class values, "What is the thing to do if a fellow (girl) much 
smaller than yourself starts to fight with you?\ was actually relatively 
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easier for blacks. If biased test questions were not obvious to expert 

judges, then perhaps statistical detection procedures could uncover more 

subtle changes in the meaning of items for different groups. 

A more disappointing resul t--after numerous statistical bias 

studies--has been that here too expert judges are often at a loss to 

explain the source of bias in items with large bias indices. For instance, 

in an early study Lord (1977) found that 46 of 85 items on the verbal 

SAT were significantly different for black and whites (bias was sometimes 

against whites). But, in studying the items identified as biased no 

particular insights could be gained to explain the differential performance. 

It had been hoped that the use of statistical bias techniques would lead 

to substantive generalizations about the nature of items found to be 

biased against specific groups. For example, Scheuneman (1979) found 

that negatively worded items were biased against blacks. This type of 

consistent finding turned out to be more the exception than the rule. 

Raju (in Green et al . , 1982) described the serious problems faced by test 

publishers who may decide to discard statistically deviant items even though 

they are unable to explain why they are biased "in terms of the content." 

Scheuneman (1982) best summarized researchers' disappointment with initial 

efforts to interpret bias: 

We naively assumed that a review of such items would readily reveal 
the source of apparent bias, that the problem could then be easily 
corrected with suitable modifications or by dropping the item from 
the test or item pool, and that a 'debiased^ instrument would result 
(p. 180). 

The disconcertingly large number of uninterpretable statistically biased 
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items leaves the test maker with a dilemma- Has the statistical indicator 
uncovered a real instance of bias, revealing a blind spot in the conceptual- 
ization of the test construct, or is the large bias index a statistical 
artifact, that is, not a valid sign of bias (see Shepard, 1981)? We 
are aware of the potential for arti factual errors in the bias methods. 
These arti factual explanations become all the more plausible when the bias 
results seem uninterpretable. 

Control of Statistical Artifacts 

There are both random and systematic sources of error associated 
with IRT bias indices. For example, the current statistical theory for maximum 
likelihood estimation in item response theory is only approximate, conclusions 
regarding group differences may by sample dependent (Bougon and Lissak, 1981; 
Lord, 1980). Lord (1980)^ in fact^proposed that replication or reliability 
studies should be carried out on independent but randomly-equivalent groups 
of blacks and whites. One purpose of the present research is to conduct such 
stability comparisons. 

There is also some "art" involved in the implimentation of IRT 
procedures. Choices made in running computer programs to arrive at 
maximum likelihood estimates can have small but important effects. In the 
use of IRT specifically for studying item bias, a difficult stage in the 
procedures is the equating phase. As we will see, parameters must be 
estimated separately for two groups but then equated to the same scale for 
comparative purposes. Errors in the equating can produce spurious 
instances of bias. In simulation studies, for example, three-parameter 
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indices of bias only correlated on the order of .80 with generated bias 
(see Merz & Grossen, 1979; Rudner, Getson, & Knight, 1980). Since the 
three-parameter logistic model was used to create bias in these data 
initially, near perfect correlations might have been expected between 
simulated and detected bias. One must conclude that either sampling 
fluctuations or some implementation problem, as suggested above, prevented 
better "recovery" of the bias that had been built-in. 

In addition to the replication or "cross-validation" method to 

> 

control for unreliability, the degree of error in bias indices can also 
be assessed by means of baseline studies. Lord (1980) created random 
groups which he called "reds" and "blues" to check on the number jf 
"significantly" biased items in a condition where there should be no bias. 
Similarly, Ironson and Subkoviak (1979) used white-white comparison groups 
to assess the validity of both classical and latent-trait bias indices. 
In this research we will use white-white comparison groups to study not 
only the amoun t of "bias" due only to sampling errors but also to 
establish numeric base line values for ir'.erpreting bias indices that lack 
distribution theory. 

Arti factual problems associated with random sampling error will 
be exacerbated when the groups to be compared differ in mean ability on 
the test. Angoff (1982) suggested that the classical p-value method 
would be more valid for detecting bias (rather than confounding differen- 
tial difficulty with item discrimination) if groups were equal or nearly 
equal on the trait initially. For example, we might expect fewer artifactual 
problems in most male-female comparisons than with black-white comparisons. 
Even the three-parameter indices, which are theoretically sample invariant, 
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may be unstable when differences between groups are large. We know, 
for example, that latent- trait equating procedures are more stable for 
horizontal equatings (different tests, same grade) than 
unstable for vertical equatings (same test^ different grades). The 
vertical equating problem where groups are located in very different 
regions of the ability continuum, is analogous to bias studies where 
groups have large mean group differences. Again referring to classical 
methods, Angoff (1902) suggested that an appropriate baseline for 
interpreting bias indices would not be just randomly equivalent white 
groups but white groups that differed in mean ability. This is the same 
analysis strategy used by Jensen (1974) when he created pseudo-ethnic 
groups, that is white groups selected on age (with a mean difference 
of two years) to simulate the average black-white differences. When we 
know that the statistical techniques are intended to correct for group 
differences but may do so imperfectly, the point is to simulate with all- 
white data what the effect might be of mean differences only. In the 
current research, pseudo-ethnic comparisons are used in addition to 
randomly-equivalent white groups. 

Purpose Summary 

The substantive purpose of this research is to study item bias 
between black and white examinees on both a mathematics and vocabulary 
test. The theoretically preferred three-parameter IRT approach will be 
used with optimal techniques for computing bias indices based on previous 
research. The major focus of the research is methodological rather than 
substantive. To assess the amount of artifactual (i.e,, spurious) bias 
identified, both randomly equivalent white groups and extreme white groups 
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pseudo-ethnic comparisons) will be used. To identify the particular 
instances of unstable bias indices, cross-validations or replication 
analyses will be performed with randomly equivalent black and white 
groups. Finally, items found to be consistently biased will be inspected 
for substantive characteristics. It is hypothesized that once artifactual 
instances of bias are better controlled, the results should be more 
incerpretable than has been the case with previous bias studies. 

Method 

Data Source 

The data used for this investigation are from the High School 
and Beyond (HSB) data files available from the National Center for 
Education Statistics.^ The HSB sample includes over 30,000 high school 
sophomores and 28,000 seniors, from a representative probability 
sample of the nation's tenth- and twelfth-grade populations. The test 
and questionnaire data were collected in the spring of 1980 by the 
National Opinion Research Center under contract with NCES. The 
particular examinees selected for study were black and white seniors. 
Unless otherwise specified (e.g., when pseudo-ethnic groups were created), 
the subsamples used were selected at random from the larger group of 
3377 black or 17,928 white seniors (excluding Hispanics). The following 

^We are grateful to Dr. Samuel Peng and Jeffrey Owings for 
providing a pupil file of individual item data as well as the publicly 
available aggregate tapes for pupils, parents, and schools. 
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study samples were created 
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Math Test 

Comparison 1: Wl, Bl n = 1501 whites, 1500 blacks 

Comparison 2: W2, B2 n = 1500 whites, 1500 blacks 

(sampling for comparison 1 and comparison 2 was 
without replacement so the samples were independent.) 
Comparison 3: W] , W2 (the white samples from comparisons 

1 and 2) 

Comparison 4: 81, 82 (the black samples from comparisons 

1 and 2) 

Comparison 5: Wl, and Pseudo 8(W3) (the white sample from comparison 

1 and white sample, n = 1500 
selected to match the distribution 
of 81 on math total score) 



Comparison 1 
Comparison 2 
Comparison 3 



Vocabulary Test 
W4, 84 n = 1500 whites, 1500 blacks 

W5, 85 n = 1500 whites, 1500 blacks 

W4, W5 (the white samples from comparisons 

4 and 5) 



^In keeping with the constraints of LOGIST, examinees were excluded 
if their scores on the relevant test were 0% or 100%. Abilities (e's) 
cannot be accurately estimated for subjects who are at the ceiling of the 
test; zero scores may not represent a valid administration and surely do 
not indicate that the examinee has been "measured." 
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The two tests analyzed were the senior Mathematics and Vocabulary 
tests. Although both tests were administered in two parts, we treated 
the combined item sets (32 in math and 27 on vocabulary) as single tests. 
More will be said about the nature of the items and the factor structure 
of the combined tests in the discussion of results. The math test was 
primarily a basic skills test involving simple operations, reading a 
graph, calculating a per unit cost, and comparing rates or distances. 
Four of the math items required some familiarity with basic algebra or 
geometry at a level that is usually included in K-8 curricula. The 
Vocabulary test was relatively more difficult; the average percent of 
seniors answering items correctly was .46 compared to .58 for math items. 
The content of the vocabulary test was clearly aimed at a higher grade 
level, either high school or in some cases college level. The words are 
almost exclusively Latin roots. According to word frequency counts in 
representative materials for grades 1 to 9 (Carroll, Davies & Richman, 
1971) the vocabulary items are relatively unfamiliar. The standardized 
frequency indices indicate that the words are found either not at all 
in junior high level materials or at a rate of about one in a million 
or one in ten million words. Examples of words with similar frequencies 
would be: fanatic, recalcitrant, marauding, permeated, reciprocating, 
and crevasses. 

IRT Bias Method 

Item response theory permits the expression of examinee responses 
to individual test items as a function of the underlying ability or trait 
measured by the test. In the results section of the paper, there are 
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illustrations of these item characteristic curves (ICCs) or item response 
functions. The horizontal dimension of the graph is the ability (or 0) 
scale. For each item, a monotonic increasing curve reflects the probability 
of getting the item correct for increasing values of 0. The ICC is defined 
by three parameters: (1) the a^ parameter is proportional to the slope of 
the curve at the inflection point and represents the item's discrimination; 
(2) the b parameter reflects the item's difficulty and is a location on 
the 0 abi lity dimension. When there is no guessing, b^ is the point where 
the probability of getting the item correct is 50%. (3) the £ parameter 
is often referred to as the "pseudo-guessing" parameter. It is the lower 
asymptote of the curve and represents the probability of getting the item 
correct for examinees of extremely low ability. 

The IRT method for detecting item bias is based on the comparison 
of item characteristic curves estimated separately for two groups. 
The ICCs reflect the probability of getting the item right as a 
function of ability. If an item is unbiased, examinees of equal ability 
should have equal probabilities of success on the item regardless of 
group membership; that is, the ICCs for different groups should be the 
same. If ICCs for two groups differ by more than sampling error, the 
item is apparently not measuring the same underlying trait for both 
groups (at least not to the same degree) and is therefore "biased." 

It should be noted that IRT models rest on an assumption of unidimen- 
sionality, i.e., that the items in the test all assess the same underlying trait 
and that only ability on that trait, not some other trait, influences 
item performance. In the results section we present factor analyses as 
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supporting evidence that this assumption is met for these data. We did 
not devote much attention to prior testing of this assumption, however, 
because in a sense the bias studies themselves are addressed to the issue 
of unidimensionality. In fact, mul tidimensionali ty will be detected as 
bias by the IRT method so long as group differences are not uniform across 
the different traits. As stated by Linn, Levine, Hastings and Wardrop 
(1981), "Bias may generally be conceptualized as mul tidimensionali ty 
confounding differences on a primary trait with differences on a secondary 
trait" (p. 161), 

The LOGIST program (Wood, & Lord, 1976; Wood, Wingersky & Lord, 
1976) was used to estimate the person abilities and item parameters. 
Because the chance (c) parameters are difficult to estimate even with large 
sample sizes, we followed the technique suggested by Lord (1980) whereby 
c^s were estimated in a combined analysis and then fixed at that value 
for the separate analyses within ethnic groups. This aggregate or 
composit analysis was done only at the level of each study comparison. 
That is, black and white samples chosen for separate replications were 
not combined to get even more stable estimates of c^. Rather, we wished to 
preserve the separateness of each comparison study and do each as if 
it were the only data available to the researcher. Additional particular 
information about how the LOGIST program was implemented is given in the 
results section. 

Scale Equating 

Once item parameters (defining the ICCs) have been estimated 
separately for two ethnic groups, the ICCs for each item must be compared 
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to detect bias. However, because the G scales from an IRT analysis are 
arbitrary (set with X" = 0 and s = 1 for the given sample), the ICCs from 
separate analyses are not directly comparable. The ICCs must first be 
equated to the same scale. To make the adjustment, we used a linear 
transformation of the b_ parameters as described in Linn, Levine, Hastings, 
and Wardrop (1980, Appendix B). Briefly, the equating is determined by 
a best fitting line which adjusts for the difference in average item b^ 
values and has a slope equal to the ratio of the standard deviations of 
the two sets of b's. In computing means and variances, b^ parameters were 
weighted by the inverse of the variance error in estimating lb. Therefore, 
items with poorly estimated b's contributed least to the equating. Once 
the linear equation was obtained, the b^ parameters for the second sample 
were recomputed in the metric of the first group, e.g., in this case, 
the black parameters were converted to the white scale. 

After the h_s were adjusted, the same equating constants (the 
slope and intercept) were also used to transform the 0 values. Finally, 
the a^ parameters were equated using the inverse of the slope determined 
for the b^ equating (Lord, 1980, p. 36). The £ parameters do not require 
equating. 

Bias Indices 

For an individual test item, bias is defined as the difference 
in the probability of answering correctly, given equal ability. Once 
item characteristic curves have been adjusted to the same scale, differences 
in the probability of a correct response are synonymous with differences 
in the ICC's. Several different indices were used to quantify ICC 
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differences between groups. 
Unsigned Indices 

1. Unsigned Area (UA). As described in Shepard, Camilli 
and Averill (1981), the area between two ICC functions 
was evaluated as a definite integral for an item i: 

3 

2. Sum of Squares 1 (SOSl ). Linn, Levine, Hastings and Wardrop 
(1980) developed both weighted and unweighted sums of squares 
statistics. The following index is similar but is 

"self weighting," in the sense that squared differences 
in probabilities are summed for every value of 0 that 
occurs , rather than creating intervals on the t) scale 
and using the midpoint of each interval. Thus, probability 
differences in the region where the most data occurs will 
contribute more to the index. 

SOSl. = ^ {Piw(ej) - PiB^Qj)^ 

V"b j=l 

3. Sum of Squares 2 (S0 S2). S0S2 is similar to SOSl except 
that squared differences in probabilities were weighted 
by the inverse of the variance error of the difference in 
ICCs for each given value of 0. 
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S0S2. 
1 



a 



P -P 

^iW ^iB 



Formulae for computing the variance error of a point on an 

estimated ICC are given in Linn, Levine, Hastings arid 

Wardrop (Appendix A, 1980). Following their reasoning, P 

differences contributed less to the weighted index if 

either P were poorly estimated. 
2 

4. Chi -square (X ) . Lord (1980) proposed an asymptotic 

significance test to compare a^ and differences between 
groups simultaneously. By the following chi-square formula, 
the hypothesis is tested that the vector of a^ and b^ 
differences is different from the vector (0, 0): 

2 "'^ 
^1 1 ab^. 1 

Where V. = { " } 

Signed Indices 

All of the above unsigned indices reflect the magnitude of 
the differences between ICCs for two groups, but they do not carry 
signs to indicate the direction of the bias, i.e., which group 
has the lower probability of a correct answer. In fact when the 
item characteristic curves cross, one group is not consistently 
disadvantaged. Rather, one group is "ahead" in one region of the 
graph, but behind in another region. 
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Signed indices are computed similarly to the corresponding 
unsigned indices. When the ICCs do not cross, the absolute values 
of the indices are the same but with a sign attached to show the 
direction of bias. When the curves cross, "bias" in two regions 
of the curve may be offsetting. 



Signed Area (SA) , When the ICCs for two groups did not 
cross in the region from -3 to +3, the SA was equal to 
the UA except that a negative sign was attached if the 
item was biased against whites, i,e,, if whites had a 
lower probability of getting the item right given 0, 
If the ICCs did cross, 0* was found as the root of the 
equation P^(0) = Pg(0)- Then the integral was evaluated 
from -3 to 0* and 0* to 3, The signed area was the 
difference between these two areas and carried the 
sign of the larger area. 

Sum of Squares 3 (S0S3) , S0S3 is the "signed sum of 
squares" index analogous to SOSl, By multiplying 
p. (O) - P-a(0) times its absolute value, rather than 
squaring the difference, the sign of the difference is 
preserved. 



S0S3, 



^ ^Piw(0j) - PiB(Qj)> lPiw(0j) - PiB^Qj)! 
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7. Sum of Squares 4 (S0S 4). S0S4 is the weighted sum parallel 
to S0S2. It is computationally the same as S0S3 except 
that every squared difference is weighted by the inverse 
of the variance error of the difference. 



S0S4, = E ' o 



^ n,,+nD j=l " 

W B ^ Op p 

8. a and b differences (AD), (BP) . Simple differences between 

r 

a parameters (a., - an) and b parameters (b - bn) were 
computed. These differences were of interest because of 
their relation to other bias indices. The a_ and b^ 
differences were not interpreted as bias indices themselves 
since separately they do not characterize ICC differences 
well. As Linn et al . , (1980) have shown, d_ and b^ 
parameters could be substantially different for two groups 
but not result in any practical differences in the ICCs. 
This would be true, for example, if an item were extremely 
difficult in both groups so the b^'s diverged in a 0 region 
where few examinees existed. 



Note that the black b was always subtracted from the white b;^, 
since this corresponds to the order of subtraction in all the other 
indices. However, because a high b_ means the reverse of a high 
probability of correct response, the signs will have the opposite meaning. 
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Results and DiscussiCii 

Factor Analyses 

Factor analyses were performed on the mathematics and vocabulary 
tests to determine if they could be considered unidimensional . Tetrachoric 
correlations were obtained using the total senior sample of 25,069. 
Pi'incipal factors were extracted after iterating for communal i ties. Each 
factor with an eigenvalue gre^.ter than one was retained for rotation. 
An oblique solution was obtained by direct oblimin transformation with 
A = 0. 

In the math test the first unrotated factor accounted for 305^ 
of the total variance. Four additional factors that met the minimum 
eigenvalue criterion of one accounted for 5%, 4%, and 3% of the variance, 
respectively. Similarly, on the vocabulary test the first unrotated factor 
comprised 30% of the total variance. In this case, there were only two 
additional factors, accounting for 6% and 4% of the variance, respectively. 
For both tests we interpreted the results as reasonably strong evidence 
of unidimensionality . First, the percentage of total variance explained 
by the first factor exceeds Reckase's (1979) minimum of 20% needed to assure 
stable item parameters. Also, an inspection of the scree plot of latent 
roots suggested that only the first eigenvalues deviated from the gradual 
rise that could be expected from factoring uncorrelated variables. 

As stated previously, the singularity of the trait measured by the 
test for the population generally may not be measured so well in particular 
subgroups. It is the purpose of the bias analysis, in fact, to test whether 
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this unidimensionality is true for both black and white groups. 

Equating and an Overall Picture of Bias 

Equating of a_ and b^ parameters from separate black and white analyses 
is a procedural step leading to direct comparisons of the ICC's for the 
two groups. The equating results also are useful themselves to provide an 
overall picture of how similar the test is for blacks and whites. 

In Figure lb a scattergram for 29 of the 32 math items depicts the 
degree of relationship between the b_ parameters obtained for white sample 
1 (Wl) and black sample 1 (Bl). (Items were excluoed from the equating 
if either one parameter or the standard error of a parameter could not 
be estimated in either group). The correlation between b^ parameters in 
the first comparison was .97. A very similar graph and identical correla- 
tion, r =.97, were obtained for the W2,B2 comparison. It is important, 
however, to contrast Figure lb with Figure 2b. Figure 2 shows the equating 
for the white-white comparison study (Wl ,W2) (rj^=.996) . In the second 
scatterplot the b^ values differ only by sampling error. In contrast, the 
greater dispersion in the black-white comparison suggests that for these 
groups some of the item b^'s are more different than can be attributed to 
sampling fluctuations, i.e., there is apparently some bias in the test. 

Figures la and 2a are the corresponding scattergrams for a^ 
parameters from the W1,B1 and W1,W2 analyses. In the first comparison the 
v/hite-black a^ parameters were correlated .78. In the parallel sample 2 
white-black comparison (not shown), the a^ correlation was .81. These 
correlations, obtained under conditions with some bias present, are in 
contrast to the white-white correlation in Figure 2a of .95. It should be 
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Figure la: Scattergram of a. parameters (W1,B1) 
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Figure 2a: Scattergram of a_ parameters (W1,W2) 
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noted that even in the white-white analysis the equivalence of the a^'s 
was -not as good as for the parameters . This "llustrates the relative 
instability of the j estimates (and explains \rfhy we prefer to use a function 
of the b_ equating to transform the a^ values). In the case of the black- 
white comparison the modest a_ correlation may also suggest that a 
highly correlated but slightly different trait is measured in the white 
and black groups. 

For the methodological purpose of this research it is especially 
significant that there is apparently some bias between the measurement of 
blacks and whites on the math test. The equivalence of the bias results 
across parallel comparisons, the intercorrelation of bias indices, and the 
comparison to a white-white baseline are all informative only if there is 
bias to be detected . There is little to be learned about the consistency 
or validity of the bias indices if they are applied in circumstances of 
no bias. 

The equating results for the Vocabulary test followed the same 
general pattern as that found on the math test. Overall, the adequacy 
of estimation was not quite so good for the vocabulary test, and hence the 
accuracy of the equating suffered. Even in the white-white analysis the 
b^'s were correlated .99 but the a's only .83 (compared to .95 for a's 
on the math test). For the two white-black comparisons the b's were 
correlated .95 and .94 respectively; a_'s from the two groups were correlated 
.80 and .56 respectively. The relatively poor correspondence between a_ 
parameters in study 2 compared to study 1, suggests the presence of 
estimation errors that are likely to turn up as artifactual (or at least 
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unstable) instances of bias. 

Bias Indices with Replication for the Hath Test 

In Table 1, the bias indices are reported for mathematics items 
across five comparisons. To simplify the amount of information, a reduced 
set of indices is presented. The signed and unsigned areas and the 
asymptotic chi-square have been the most popular in the past. As we will 
see from the correlational results, the weighted and unweighted SOS 
statistics are highly similar, but there is some evidence for preferring 
the "behavior" of the weighted versions; therefore only S0S2 and S0S4 
are shown. 

The first two sets of indices, from Comparison 1 and Comparison 2, 
are the replicated bias studies based on randomly equivalent groups of black 
and whites. As will be discussed in the next section, a baseline for 
judging the magnitude of the bias indices was obtained from the white- 
white analysis. Index values that exceeded the largest number occurring 
in the white-white analysis are starred as biased in Table 1. 

There were a substantial number of items with ICC differences that 
replicated across studies. Out of the 29 math items, for which ICCs were 
estimated in both groups, 10 were consistently "biased." (Three of these 
items were biased in favor of blacks.) We said items were consistently 
biased if they exceeded the cut-off on four or five of the indices in both 
studies. It is worth noting that fully one-third of the test items appear 
to be deviant by this relatively stringent rule. When we caution that 
item-bias methods are internal methods and hence unable to detect constant 
bias, this does not imply that we are limited to finding only one or two 
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discrepant items. 

Figures 3-6 are item characteristic curves for blacks and whites on 
several illustrative math items. In Figure 3a, the graph for an unbiased 
item is shown. The two solid lines reflect the probability of a correct 
response, given 0, for blacks and whites. For all values of 0, the two 
groups have essentially equal probabilities of answering correctly. Figure 
3b is an example of a biased item. The white curve is consistently above 
the black curve, so whites and blacks of equal ability do not have the same 
probability of success, (The curves for item 7 v/ere similarily discrepant 
in comparison 2 with a slightly larger effect,) 

The items in Figures 4a and 4b are also consistently biased in 
both comparisons. These graphs are more typical of most of the biased 
items in that the ICCs for the groups cross within the 0 region of -2 to +2, 
Therefore, the bias in one region of the curve is partially offset by 
a reverse bias at the other end of the 0 scale. Signed indices allow 
this cancelling effect to occur and therefore only show a large bias 
index if one group is overall more disadvantaged than the other. Even 
between the signed indices there is a difference in how bias is quantified. 
The signed area (SA) is a simple measure of the amount of squared difference 
between the curves. The signed S0S4 index is more heavily weighted 
in regions where more examinees are concentrated. In Figure 4a both the 
signed area and S0S4 index are large, whites have a considerable advantage 
over blacks for 0's above -1, In Figure 4b, the areas of advantage and 
disadvantage are more nearly equal, hence a near zero signed area. The 
S0S4 value for this same item is quite substantial, however, because more 
examinees of both groups, especially blacks, are located in the vicinity 
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Figure 3b: Item 7, Comparison 1 (W1,B1) 
Biased in both studies 
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Figure 4a: Item 6, Comparison 2 (W2,B2) 
Biased in both comparisons 
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Figure 4b: Item 11, Comparison 1 (W1,B1) 
Biased in both comparisons 
ICCs cross, biased against blacks in 

region where most blacks are located. 
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of -1 to 0 theta. 

Two examples of items biased against whites are shown in Figures 
5a and 5b, items 12 and 17 respectively. Two graphs from the parallel 
analyses are presented for item 17 to illustrate the replication results. 
The amount of similarity seen here between two independent but equivalent 
comparisons is fairly typical of the degree of stability found for 
consistently biased items (and for consistently unbiased items as well.) 

Item 30, in Figure 6, is an "artifactually biased" item. All of 
the indices are substantially deviant in comparison 1 but not in comparison 2, 
Item 30 is very difficult for both groups. Hence, the a_ and b^ parameters 
must be estimated in a region where there is relatively little data. The 
difficulty in estimation is reflected in large standard errors. It should 
be noted hov/ever that even the statistics which take standard errors into 
account ('K , S0S2, S0S4), and the SOS measures which deemphasize discrepancies 
in regions with little data, had large values from this apparently spurious 
bias. Note also that Item 30 can be seen in the scattergram of Vs (Figure 
lb) as a clear outlier; misestimation of b^'s in both groups had a compound 
effect in comparison 1 that did not occur in comparison 2. 

White-white and Black-black Comparison on the Math Test 

Item characteristic curves deti^ rlined in two randomly equivalent 
groups should differ only by sampling e^ror. Comparison 3 in Table 1 
contains the bias results for two samples of whites (W1,W2). Logically, 
there should be no bias in this comparison and, indeed, inspection of these 
data indicates that all of the indices are appreciably smaller than in the 
white-black comparisons. Only item 30 which we know had estimation problems 



Figure 5a: Item 12, Comparison 2 (W2,B2) 

Biased against whites in both studies 
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Figure 5b: Item 17, Comparison 1 and Comparison 2, respectively 

Biased against whites in both studies (by different amounts), 
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Figure 6a: Item 30, Comparison 1 (Wl.Bl) 
Biased in comparison 1 only 
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in sample 1, stands out with relatively large values. (Still, the numbers 
are much smaller than the corresponding indices in the between-ethnic 
comparison. ) 

The non-zero values of each index in Comparison 3 indicate the 
ranges in magnitude that occur as sampling fluctuations. Therefore we used 
the largest value of each index occurring in study 3 as the cut-off for 
evaluating bias in the black-white studies. 

The stability of results in the two white samples was relatively 
dramatic. Therefore, we wondered if the white-v/hite comparison would 
produce too stringent a baseline. It was conceivable that estimation problems 
could be more difficult in the black group. Although all samples were 
equal in size (n = 1500), something like a range restriction problem 
in the black group could make parameters more unstable for this group. This 
unreliability could then lead to spuriously large bias indices— especially 
if the more stable white-white analysis were used as the baseline. 

To test the above hypothesis we also conducted a black-black 
"bias" analysis. Indeed, we did encounter more estimation problems than 
with previous analyses. All but two item ICCs had been estimated for 
sample Bl when £'s were estimated in common with Wl. However, standard 
errors could not be estimated for more than a third of the items when 
Bl was rerun with pooled £'s from B2. We eventually were able to finesse 
the LOGIST runs by inputing initial item parameters from the Bl, 
B2 run and by raising the upper limits on a^'s. After these estimation 
difficulties were resolved, however, ICCs comparisons for the two black 
samples (comparison 4, Table 1), did not result in a wholesale increase 



in the number of large bias values. Therefore, we continued to use the 
baseline values obtained in the white-white study. 

Pseudo-ethnic Comparison for the Math Test 

It is conceivable that even IRT methods, which are theoretically 
sample invariant, may be inadequate when differences between groups are 
large. On the math total scores, blacks were .91a below the white group. 
To what extent might the apparent item discrepancies in Table 1 be due to 
failure of the model to cope with mean differences in the separate ICC 
analyses? To answer this question, we created a pseudo-black sample. 
This group was selected at random from the original file of white 
examinees but with the probability of being selected constrained to match 
the relative frequency distribution of black total math scores. (We 
recognize the circularity implicit in matching on the very test to be 
analyzed; in a separate program of research we are using different sets 
of background variables external to the test, e.g., SES factors and 
instructional history, to study their effect on the issue of bias.) The 
white sample matched to the black distribution can give us a rough idea 
of the amount of deviance showing up in the bias indices solely as a 
function of mean group differences and sampling error. Because of 
regression effects on individual items however, the Wl, Pseudo B(W3) 
comparison is not quite so extreme as the W1,B1 difference. 

The results of the pseudo-ethnic bias study are shown as comparison 5 
in Table 1. Note that there are very few large indices. Therefore, the 
large amount of deviance in the black-white analyses must be due to real 
differences in the functioning of the items across groups rather than 
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being due to artifacts of the mean difference in math achievement. 

The chi -square index produced the greatest number of large values 
i.i the pseudo-ethnic comparison. For the four items where the X is starred 
as biased but no other index exceeded its cutoff, there was in each case 
a fairly big shift in the b parameter. As Linn et al., (1980) point out, 
it is questionable whether differences only in b^ parameters should be 
taken as evidence of bias. For these items the b_ shift is not reflected 
in overall ICC differences, or else the other indices would have shown 
large effects as well. 

Correlations and Agreements Among Bias Indices 

In subsequent sections the bias analyses for the Vocabulary Test 
will be presented and the nature and importance of the apparent bias in 
the Math Test will be explored. Here, we wish to discuss some methodological 
issues regarding the functioning of the bias statistics. Results are 
presented for both tests to check on the generalizabi lity of study findings. 

To examine the relationships between indices, wi thin-study 
correlations were obtained for each comparison on each test. Tables 2 and 
3 contain the wi thin-comparison coefficients for the Math and Vocabulary 
Tests, re^pectively. As we explained in previous work (Shepard, Camilli, 
& Aver"* 11, 1981), Spearman rank-order correlations are preferred. With 
the Peari:on r, one very extreme item will occasionally inflate or obscure 
the degree of relationship. When studying bias, congruence in the identifi- 
cation of extreme items is of primary interest; therefore, we did not wish 
to trim the distribution or eliminate outliers. 

In both Table 2 and Table 3 the first two entries are for 
comparisons where some bias is present, i.e., these are the between-ethnic 



Table 

Intercorrelation of Bias Indices within Comparison 
on the Math Test 
(repeated for five compa, isons) 
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Order of r's: 



Bl.Wl 
B2,W2 
W1,W2 
B1,B2 

Wl, Pseudo B 



n's = 30 items 
29 
27 
27 
29 



UA 



UA 



SOSl 



S0S2 



SA 



S0S3 



S0S4 



AD 



>0S1 


S0S2 




SA 


S0S3 


S0S4 


AD 


BD 


.89 


.83 


.84 




4? 


. oc 


- 40 




92 


. 78 


Rl 

• ux 




31 








.74 


.73 


.86 






n7 


- ?R 


1 ? 


*59 




.65 


in 


nzi 


. UD 


. u c. 






.65 


^70 


.55 


.65 


.51 


-.27 


-.46 




Qn 


Ql 


• cU 


. JH 


1 Q 




- 1 R 






Ql 

. -/ JL 


9/1 


. O JL 




nn 


- n*^ 




R7 
. (J/ 


. 


nzi 


1 7 


91 

. cJL 


- 1 


. Oc 






. 


ni 

• U JL 


- n? 


1 9 

. JL c 


- n9 


- n9 




.71 


.79 


.34 


.54 


.39 


-.23 


-.26 








OR 


9? 


1 7 
• JL / 




- n4 






Qn 


nzi 




. JO 


- Ifi 

. xu 


1 ? 

* XL, 






R7 
. (J ' 


1 7 

• JL / 


9*^ 


91 
. c JL 




9n 










nfi 


91 

. C JL 


- 1 n 

. JL U 


- n*^ 






.65 


.14 


.34 


.36 


-.36 


-.04 








• CC 


?Q 


n9 




- 9d 








1 A 


1 Q 


9*5 

. CO 


. JLD 


- n'^ 








. U/ 


1 9 


1 ^ 


• xu 


9Q 








n/1 
- . \}^ 


n? 
. u/ 




- 1 n 

- . JLU 


. UD 








.32 


.34 


.21 


-.18 


-.26 










R7 
. o/ 


RQ 


- n9 


- Q9 










. 


. 


1 1 

. JL JL 


- 74 










7*^ 


. JD 


'59 


7*^ 










79 


ni 


9n 


- QQ 










. 00 


. D / 


nQ 


- QR 












.79 


-.20 


-.82 












.77 


-.02 


-.70 












.81 


-.41 


-.38 












.57 


-.13 


-.65 












.74 


-.04 


-.84 














-.01 


-.44 














.00 


-.28 














-.22 


.01 














-.55 


.05 














-.22 


-.46 
















.11 
















-.02 
















.19 
















-.17 
















-.11 
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Table 3 

Intercorrelation of Bias Indices Within Comparison 
on the Vocabulary Test 
(repeated for three comparisons) 



Order of r's: WA, B4 n = 25 items 

W5, B5 
W4, W5 



UA 


SOSl 


S0S2 




SA 


S0S3 


S0S4 


AD 


BD 


UA 


.86 
.88 
.91 


.59 
.82 
.64 


.71 
.85 
.95 


.28 
-".04 
-.01 


.38 
.11 
.11 


.27 
.35 
.06 


.42 
-!l4 
-.16 


-.31 
!l3 
.01 


SOSl 




.71 
.93 
.78 


81 
.94 
.97 


.21 
-.05 
.01 


31 

• ^ X 

.09 
.18 


. 17 
.17 
.24 


.45 
.13 
-.05 


_ ?7 
.04 
-.05 


S0S2 






.95 
.88 
.76 


-.04 
-.04 
.24 


.10 
.13 
.40 


.15 
.34 
.36 


.33 
-.02 
.10 


.06 
.10 
-.24 


X? 








.06 
-.15 
.04 


.19 
.02 
.17 


.16 
.23 
.16 


.47 
.21 
-.08 


-.04 
.08 
-.07 


SA 










.96 
.90 
.85 


.57 
.49 
.12 


.21 
-.iO 
.25 


-.96 
-.90 
-.97 


S0S3 












.75 
.65 
.44 


.21 
-.16 
.10 


-.91 
-.84 
-.81 


S0S4 














-.11 
-.33 
-.13 


-.47 
-.35 
-.07 


AD 
















-.18 
-.12 
-.20 
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comparisons. The remaining entries reflect the degree of correspondence 
between indices within a comparison where there are different amounts 
of sampling instability but presumably no bias. Although one might expect 
the correlations between indices to be higher in the presence of bias, 
this is in fact not the case. The indices which are highly similar to 
each other are similar whether they are quantifying extreme deviance 
or only sampling perturbations. Afterall, whatever these sampling 
fluctions are, they are consca.it within a given comparison . 

The unsigned indices are highly correlated suggesting they will 
yield fairly redundant information. The signed indices are also 
correlated with each other. However, the S0S4 statistic and the other 
signed statistics are less highly intercorrelated than the unsigned indices. 

It was on the basis of these within-study correlations that we 
eliminated the simple sum-of-squares statistics from some of the results 
tables. The SOSl index is essentially redundant with both the unsigned 
area and S0S2; S0S3 gives nearly the same picture of bias as the signed 
area. Note also that the pattern of relationships among indices (across 
comparisons of different types) was highly similar for both the math and 
vocabulary tests. 

The more important test of the stability and validity of the indices 
as signs of bias is the pattern of correlations between study comparisons. 
As indicated above, the within study correlations show consistency from 
both true and error sources of variance. The betv/een comparison correla- 
tions are given in Tables 4 and 5 for the math and vocabulary test 
respectively. Again, rank-order correlations were computed (but Pearson r's 
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W1,B1 with W2,B2 
W1,B1 with W1,W2 
W2,B2 with W1,W2 

Wl.Bl with B1,B2 



W1,B1 with Wl, 
Pseudo B 

W1,W? with Wl, 
Pseudo B 



Table 

Correlations of Each Bias Index with Itself 
Across Study Comparisons on the Math Test 

Spearman r 
(Pearson r) 



1 1 A 

UA 


cnc 1 


cnc9 


A/ 


on 




S0S4 

kJ \J kJ I 


AD 


BD 


.71 


.70 


.72 


.80 


.72 


.83 


.73 


.83 


.65 


(.57) 


(.66) 


(.75) 


(.65) 


(.78) 


(.89) 


(.84) 


(.61) 


(.50) 


33 


.15 


.08 


-.02 


.08 


.27 


.12 


-.20 


.22 


(.53) 


(.32) 


(.29) 


(-.14) 


(.19) 


.(.05) 


(.17) 


(.00) 


(.53) 


.27 


.06 


-.08 


.08 


■ .01 


.06 


-.15 


-.46 


.08 


(.17) 


(.02) 


(-.05) 


(-.02)' 


(.07) 


(-.18) 


(-.21) 


(-.43) 


(.39) 


.32 


.11 


.26 


.17 


-.11 


.10 


-.05 


-.21 


-.15 


(.39) 


(-.04) 


(-.06) 


(.50) 


(-.08) 


(.08) 


(-.13) 


(-.45) 


(-.25) 


.32 


.18 


.36 


.33 


.49 


.22 


.18 


.28 


.41 


(.44) 


(.20) 


(.15) 


(.25) 


(.55) 


(.30) 


(.23) 


(.05) 


(.29) 


.24 


.22 


-.24 


.19 


.17 


.44 


.37 


.16 


.21 


(.27) 


(.26) 


(-.07) 


(.27) 


(.30) 


(.46) 


(.23) 


(.07) 


(.36) 



Note: Only for the correlations between W1,B1 with W2,B2 is there the possibility 
for agreement when bias is present. For other correlations, one or both 
of the comparisons involved randomly equivalent groups or two white groups; 
therefore, there should be no consistent bias. These latter pairs do share 
some consistent error s, however, since in each case one of the samples is 
repeated in both comparisons. Only in the correlations below should there 
be both no bias and no sample redundancy. 

■>2 27 -.02 .21 .42 .04 -.09 -.04 .09 

W1,W2 with B1,B2 Ch) Cu) (-.15) (-.14) (.47) (-.14) (-.08) (-.01) (.23) 
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Tables 

Correlations of Each Bias Index with Itself Across 
Study Comparisons on the Vocabulary Test 

Spearman r 
(Pearson r) 



Indices 





UA 


SOSl 


S0S2 


X2 


SA 


S0S3 


S0S4 


AD 


BD 


W4,B4 with W5,B5 


.60 


.64 


.86 


.83 


.63 


.82 


.84 


.32 


.56 




(.80) 


(.89) 


(.71) 


(.80) 


(.74) 


(.91) 


(.8-) 


(.49) 


(.81) 


W4,B4 with W4,W5 


.60 


.46 


.18 


.45 


.00 


.02 


.12 


.32 


-.14 




(.40) 


(.10) 


(.23) 


(.18) 


(.06) 


(-.08) 


(.23) 


(.64)( 


-.41) 


W5,B5 with W4,W5 


.61 


.41 


.24 


.45 


-.49 


-.31 


-.09 


-.33) 


-.50 




(.44) 


(.17) 


(.45) 


(.31) 


(-.49) 


(-.34) 


(-.50) 


(-.37)( 


-.37) 



Note: Only for the correlations between W4,B4 with W5,B4 is there the possibility 
for agreement when bias is present. The later pairs do share some 
consistent errors , however, since in both cases one of the white samples 
is repeated in both comparisons. 
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are also given in parentheses). These coefficients reflect how highly 
a bias index correlates with itself across study comparisons; that is, 
hov^ consistently does it rank the 29 math items studied? 

In a sense these coefficients can be examined for convergent and 
discriminant validity as in a mul ti trai t-multimethod matrix. The first 
line of each table is where we expect to see the effect of the trait on 
the magnitude of the correlations. The trait is, of course, "bias" in 
the test items or differential functioning of the items due to cultural 
background. Only in the first row are the correlations between two 
randomly equivalent ethnic comparisons. It is here that we would expect 
to see consistency in the detection of bias. Indeed the degree of relation- 
ship is quite good; r*s for the math test range from .70 to .83, for the 
vocabulary test they are on the order of .60 to .86. (Note that a^ and b^ 
differences are presented to study their correspondence with other statistics, 
but they are not interpreted as indices of bias.) 

The subsequent rows in the between-group matrices contain correlations 
where bias should not be the source of agreement. In all the remaining 
rows at least one or both of the comparisons were between equivalent groups 
(either both white or black). These correlations should show discriminant 
validity or the lack of method-specific correlations. These correlations 
should be near zero, confirming a lack of bias when none exists conceptually. 
However, it should be noted that these pairs of comparisons do share some 
consistent errors since one sample is repeated in both comparisons. For 
example, we expect the correlation between indices obtained in the W1,B1 
study and those from the B1,B2 study to correlate zero. Bias can be 
present in the first study but not the second. The two comparisons do, 
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however, share the Bl sample. Therefore the two studies could have some 
consistent spurious "bias" based on sample characteristics. Only in the 
last row of the math data (Table 4) are there correlations between conditions 
where there should be both no bias and no consistent sampling error. 

The discriminant coefficients show the reduced relationships 
necessary to support the validity of the bias indices. For example, the 
S0S2 statistic is correlated .72 with itself when bias is present in both 
studies; it is correlated only .02 to .36 across studies where bias is not 
the source of relationship. However, the pattern of high-trait, low-method 
correlations is not so good for the unsigned indices on the vocabulary test. 
Two reasons should be kept in mind; the vocabulary test is less biased 
and as we explained in previous research (Shepard, Camilli, & Averill, 
1981), it is more difficult to show ranking"" consistency with unsigned 
indices because they are one-tailed distributions. That is, unsigned 
indices have both items biased against blacks and those biased against 
whites in the same tail of the distribution making it more difficult to 
demonstrate consistency across parallel studies. 

We are tentatively prepared to recommend the S0S2, S0S3 and S0S4 
indices as the more valid indices of bias. Not only are these statistics 
the most consistent in detecting bias in the ethnic comparisons, they 
also intercorrelate the least in situations of no bias. A minor caveat 
is warranted, hov/ever, regarding the two weighted measures (S0S2 and S0S4). 
Because in our method of IRT estimation we fixed c_'s from a common analysis, 
we assumed that standard errors for £ were zero in the weighted SOS formulae. 
To the extent that this assumption was erroneous, especially for very easy 
items, the same false assumption could add spurious agreement to the 
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between-study consistency. We judged this effect to be very slight. 

This problem could not, of course, explain the desirable drop-off in 

correlations in conJ:rasts where bias should not be present. The S0S3 

statistic was not affected by this assumption. 

Correlation coefficients are only a crude method for summarizing 

the consistency of indices in identifying biased items. We are not actually 

interested in the consistency with which unbiased items are ranked. Rather, 

only the consistency at the extremes of the item rankings is important. 

Using the cut-offs determined from the white-white analysis, items were 

classified as either bias or not biased by each index. The contingency 

tables in Table 6 show the consistency of these dichotomous classifications 

from one black-white comparison to the other (on the Math Test). Here 

it should be clear that the S0S2 and S0S4 are relatively the best, and in 

an absolute sense, quite good at consistently classifying items as biased 

2 

or not biased. The % statistic is next-best in the amount of replicated 

2 

bias. But, as we explained earlier the X can consistently identify 

as biased items that have a large parameter difference but do not have a 

commensurate probability difference for most sampled 0s. This occurs 

especially when items have large b^ differences at the extreme ranges of 0. 
2 

The X index has the property of consistency but is less desirable on 
other grounds. 

The agreement results found for the math test were only partially 
duplicated on the vocabulary test. The percentages of agreements were 
as follows: UA 70%; SOSl 75%; S0S2 90%; 0(? 85%; SA 70%; S0S3 55%; S0S4 75%. 
On the Vocabulary Test there was less bias; also on this test we had more 
difficulty justifying a particular cut-off from the white-white analysis. 
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Table 6 



44 

Agreement of Indices in Equivalent White-Black 
Comparisons on the Math Test 



Unsigned Area 



Comparison 1 
B NB 



Comparison 2 



NB 



SOSl 



Comparison 2 



NB 



Signed Area 



Comparison 2 



NB 



S0S3 



83% agreement 



Comparison 1 
B NB 



rHJ Ml 






rai INI 



B 

Comparison 2 
NB 

S0S2 

B 

Comparison 2 
NB 



Comparison 1 



B 


NB 


Ml 

rm • 






II 

hl4sl 


86% 

Comparison 1 
B NB 








rtij 

II 



86% agreement 



79% agreement 
Comparison 1 
B NB 



93% agreement 



hW II 






.^W II 



83% agreement 
Comparison 1 
B NB 



S0S4 



Comparison 1 
B NB 



B 

Comparison 2 


III 




B 

Comparison 2 






NB 




1 


NB 




mj Mil 

mj Ml! 



83% agreement 



97% agreement 
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Note: These counts are based on the individual item data presented in Table 1- 
Biased items, starred in Table 1, had indices for a given comparison 
that exceeded the cut-off value determined from the white-white 
comparison. For the index, however, the critical value of 5,99 was 
used here. 
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Substantive Interpretation of Bias Results 
for the Math Test 

The original premise motivating this research was that the results 
of item bias analyses would be more interpretable if statistical artifacts 
could be controlled. Specifically, we expected to see more of a pattern in 
test items found to be biased if we studied only those items that were 
cross validated, i.e., found to be deviant in parallel black-white 
comparisons. 

Once we had identified the consistently biased and unbiased items 

on the math test, we looked at the actual test questions. It was immediately 

obvious that the verbal math problem-solving was the source of the bias 

against blacks. (To be honest, the cross-validation did little to clarify 

this picture. The indices were consistent enough across studies that very 

nearly the same insight would have been gained by looking only at the 

results from one black-white comparison.) 

All of the HSB math items (Part 1 and Part 2) had the following 
6 

. ormat: 

Directions: Each problem in this section consists of two quantities, one placed in Column A and one 
in Column U, You are to compare the two ciuuntities and mark oval 

A if the quantity in Column A is greater; 

n if the quantity in Column 1} is greater; 

C if the two quantities are equal; 

D if the size relationship cannot be determined from the information given. 



I'Je are grateful to Dr. Thomas Hilton and to the Educational Testing 
Service for providing us with copies of the test materials. We thank ETS 
and NCES for permission to reproduce the sample items and to create parallel 
item types to illustrate the nature of the verbal and numeric problems. 
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Sample Questions 



Column A 



Example 1. 
Example 2. 



20 per cent of 10 
6x6 



Column B 

10 per cent of 20 
12 + 12 



Sample Answers 

® ® c::> dD 

O CD ® 



Answer C is marked in Example 1 since (he quantity in Column A is equal to the 
quantity in Column B. Answer A is marked for Example 2 since the quantity in Column A is 
greater than the quantity in Column B. 



We called items like Example 1 verbal and those like Example 2 numeric. 

A more realistic illustration of the verbal -type items is 
provided by the two following questions. These items were written to 
parallel two actual test questions that were found to be consistently 
biased against blacks: 



Column A 



Column B 



1- 



2, 



Number of centimeters 
between -7 cm and 
+8 cm 



Cost per pound at 
a rate of $4.00 for 
twenty pounds 



Number of centimeters 
between -8 cm and 
+7 cm 

Cost per pound at a 
rate of 3 pounds for 60(t- 



A type of numeric problem found to be consistently biased in favor of 
blacks was parallel to the following example: 

3. 326 3(10)3 + 2(10)^ + 6(10) 
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Numeric items that were consistently unbiased were similar to the following 

4. * rsfie 16 

5. 5a 6x 

The only numeric item found to be biased against blacks was comparable to 
this item: 

33 V 5 37 V 5 

If questions had a verbal phrase in one column and a numeral in the other 
column, we called themV + N. The classification of math items as verbal 
or numeric was shown in Table 1. 

The following contingency table depicts the cross-tabulation of the 
bias results with item type: 

Table 7 

Bias Classification and Item Type for Math Items 



Verbal V+N Numeric 



consistently biased 
against blacks 


iW i 




1 


possibly biased 
against blacks 


ri 




1 


not biased 


II 


II 


rw 1 


possibly biased 
against whites 








consistently biased 
against whites 






III 
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These data show a striking degree of relationship suggesting that the bias 
indices are indeed sensitive to a change in meaning of the underlying trait 
for* black examinees as measured by the verbal items. 

The foregoing conclusions have been rather enthusiastic. The bias 
indices, especially the SOS statistics yield consistent results (with these 
sample sizes). They show appreciable discriminant validity between the biased 
and nonbiased studies. And, when the test questions themselves are examined, 
the indices seem to have signaled interpretable instances of differential 
performance. This enthusiasm must be tempered somewhat by the following result. 
In practical terms we wished to quantify the effect of having biased items in 
the test. Therefore, we rescored the math test deleting the seven items found 
to be consistently biased against blacks. We compared the new black and white 
means in the metric of the white standard deviation- The difference was .81 cr- 
For the unexpurgated test it had been .91a. The effect of the biased items 
(however consistent) is small but not trivial.^ The relatively small 
magnitude of the bias effect can also be seen by examining the ICC 
graphs for typical biased items. Although the curves are discernably 
different, the probability differences are not very large. Item 6, 
comparison 2, was selected for illustration (Figure 4a) because it had 
the very largest unsigned area statistic of all the biased items. At 
its height the probability difference between blacks and whites is .13. 
More typically the largest black-white difference on a biased item is 
only .05 to .10. This would mean on average roughly one more item correct 
for blacks if the biased items were removed. 



The effect on black-white differences would have been smaller still, 
if we had deleted the three items biased against whites. However, the 
bias against whites was much less interpretable. 
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To illustrate further the practical import of the seven items 
biased against blacks, we also simulated the effect on failure rates if 
the test had been used to make pass-fail decisions as in a minimum- 
competency testing program. To establish comparable cut-off scores, raw 
scores were selected that would fail 10% of the whites on both the full 
and debiased tests. The corresponding failure rates for blacks on the 
two tests were 36.3% and 30.3% respectively. 

The finding that the overall effect of bias is small tempers both 
our methodological and substantive conclusions. To be sure, we must 
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remember that internal bias indices caanot detect constant bias. Because 
the. format of all the HSB problems requires some verbal reasoning we may 
ha'^e underestimated the effect of pervasive bias from this source. It is 
also plausible that a math achievement instrument developed for a national 
survey would be much less biased than many other tests. Because the bias 
results were consistent and interpretable in a test with a relatively small bias 
effecti we are inclined to believe that the indices are sensitive to 
relatively subtle but meaningful sources of bias. We expect that the 
desirable properties of the indices for bias detection would be enhanced 
in situations where there was a greater amount of bias. We would predict 
for example, that in field trials of new test items, there would be more 
bias to be detected. 

Bias Results for the Vocabulary Test 

Bias indices for the Vocabulary Test are presented in Table 8. 
Again, Comparisons 1 and 2 are randomly equivalent black-white analyses. 
Comparison 3 is between two random samples of whites, a circumstance where 
there should be no bias. The largest values obtained in the white-white 
comparison were used as baselines for interpreting the size of indices in 
the between-ethnic comparisons. Because two items in the white-white analysis 
stood out as different from the typical range of values, the indices from 
the second-most discrepant item were used to establish the cut-offs. 

The methodological results from the Vocabulary Test were discussed 
earlier. Generally, they corroborated the findings based on the Math Test, 
but patterns were sometimes weaker because there was overall less internal 
bias in the Vocabulary Test. This test was very difficult for both groups. 
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Table 8 



Item 1 



Signed and Unsigned Bias Indices for Vocabulary Items 
(in three Comparison Studies) 



Comparison 1: W4, B4 



UA 



Unsigned 
S032 



Signed 
SA S0S4 



Comparison 2: W5, B5 
Unsigned 



UA S0S2 



Signed 
SA S0S4 



Comparison 3: W4, W5 

Unsigned Signed 
UA S0S2 SA S0S4 



o 

L 


o c * 


11. 04* 


O O O 1 * 


oc* 
- . db 


1 1 n/i* 


1 Q 


1 n 

iU . 


7/1* 


1 R nn* 

1 J , uu 


- 1 R 
. i 0 


-in 7?* 

— iU . /J 


n^ 

. U3 


.23 


1 . 56 


- .05 


-.15 


J 


. 0^ 






no 




. Ub 




7/1 


1 7n 
i . / u 


nfi 

. uu 


. 74 


.04 


1 . 54 


.96 


.01 


1.53 


/I 

4 




in no* 


1 7 CC* 

1/ . bb'^ 




Q 77* 


oo* 




9t^* 


00 00* 

Co . C.C. 
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Mote* To establish a baseline for judging the magnitude of the bias indices, the values from the second most 
deviant item in the white-white comparison were used. Indices that exceeded this cut-off in other 
comparisons are starred as "biased. For the sake of consistency the item 8, W1,W2,%^ of 7.19 was used, 
however 5.99 is the critical value for statistical significance at a = .05. 
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Inspection of the content also suggested that the test was extremely 
unidimensional for example, we could not categorize the words a priori as 
being more or less frequent in everyday language. All of the words seemed to 
have a literary flavor and were school and book oriented. Therfore, we 
were uncertain as to whether the analysis would detect differential 
difficulty. 

The consistently biased items seen in Table 8 are not immediately 
interpretable. Initially we conjectured that there might be some speed 
effects present in this test since the two parts had time limits of only 5 
and 4 minutes respectively. (Note that Part II starts with item 16.) 
However, there were not, in fact, appreciably different omitted or not- 
reached rates between the two groups. Four items appear to be consistently 
biased against blacks, items 4, 16, 17, 18. This result was puzzling because 
these are consistently the easiest items in the test. Only three other 
items (#1, 3 and 5) are as easy (and #1 could not be estimated). Apparently 
there may be a floor effect here whereby blacks scoring near chance on many 
other items in the test cannot look as different on the very difficult items 
as they do on easy items. (Note, item 8 would have contradicted this trend 
since it is biased against blacks and is very difficult (P^ = .35; Pg = .20), 
however, we ignored item 8 because it was also "biased" in the white-white 
comparison. ) 

Summary and Conclusions 

The purpose of this research was to apply item response theory bias 
detection procedures to both a mathematics achievement and vocabulary test. 
Because the results of previous item-bias studies have often been uninterpre- 
table, we wished to account for statistical artifacts by conducting 
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cross-validation or replicatio.n studies. Therefore, each analysis was 
repeated on randomly equivalent samples of blacks and whites. Furthermore, 
to establish a baseline for judging bias indices that might be attributable 
only to sampling fluctuations, bias analyses were conducted comparing randomly 
selected groups of whites. Also, to assess the effect of mean group differences 
on the appearance of bias, pseudo-ethnic groups were created. That is, 
samples of whites selected to simulate the average black-white difference 
were also tested for bias. 

The validity and sensitivity of the IRT bias indices was supported 
by several findings: 

1. A relatively large number of items (10 of 29) on the Math Test 
were found to be consistently biased, i.e., the results were 
replicated in parallel analyses. (Seven were biased against 
blacks, three were biased again5;t whites.) 

2. The bias indices were substantially smaller in white-white 

analyses. That is, with the exception of one or two estimation 

artifacts, indices did not find bias in situations of no bias. 

2 

3. Furthermore, the indices (with the possible exception of X ) 
did not find bias in the pseudo-ethnic comparison. Therefore, 
bias by these methods is not an artifact of mean-group differences. 

4. The pattern of between study correlations showed high consistency 
between analyses where bias was plausibly present (e.g., between 
parallel ethnic comparisons). 

5. Also, the indices met the discriminant validity test. That is, 
the correlations were low between conditions where bias should 
not be present. 
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6. For the math test where a substantial number of items ' 
appeared biased, the results were interpretable. Verbal 
math problems were systematically biased against blacks. 

7. Finally, the desirable pattern of between comparison correlations 
was replicated on the Vocabulary Test, albeit somewhat weaker 
because of less bias on this measure. 

Overall the sums-of-squares statistics(weighted by the inverse of the 
variance errors) were judged to be th'} best indices for quantifying ICC 
differences between groups. Not only were these statistics the most consis- 
tent in detecting bias in the ethnic comparisons, they also intercorrelated 
the least in situations of no bias. Lord's (1980) asymptotic chi-square 
was consistent but was sometimes sensitive to parameter differences that 
did not have corresponding effects on ICC differences. 

When statistically biased items on the Math Test were examined 
a strong relationship was found between the verbal properties of the item 
and bias classification. Most of the verbal problems on the test were 
biased against blacks, and with one exception numeric problems were not. 
This highly reliable and interpretable result had to be tempered by the 
finding that the magnitude of the bias effect was relatively small. When items 
biased against blacks were deleted and the test rescored, the difference 
between blacks and whites was changed from .91a to .81a . The 
bias indices are apparently sensitive to consistent but subtle effects. 
Presumably the validity evidence for the bias statistics would be 
increased in situations where there is greater bias, as in field tests of 
newly developed test items. We did not make substantive interpretations of 
bias findings for the Vocabulary Test. The amount of internal bias was much 
less for this instrument. " 
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